Today, we will focus on detecting violations of standard linear regression assumptions such as normality, homoscedasticity, and linearity. We describe methods to address these violations through data transformations and other methods. In addition to testing these model assumptions, we will discuss other data problems that adversely affect regression results: outliers, influential observations, and multi-collinearity.
The following assumptions underlie the regression model, in particular about the random errors:
If you have not installed the packages, please run this command.
#install.packages("dplyr","MASS","car")
We start loading the libraries
library(dplyr)
library(MASS)
library(car)
rm(list = ls())
set.seed(40)
We import the data, convert the columns, and scale the data.
data <- read.csv("insurance.csv")
data$gender <- as.factor(data$gender)
data$gender <- relevel(data$gender, ref = "male")
data$smoker <- as.factor(data$smoker)
data$smoker <- relevel(data$smoker, ref = "no")
data$region <- as.factor(data$region)
data <- data %>% mutate_if(function(col) is.numeric(col) & !all(col == .$charges), scale)
Now, we create the regression model:
model <- lm(charges ~ age + bmi + children + gender + smoker + region, data)
summary(model)
##
## Call:
## lm(formula = charges ~ age + bmi + children + gender + smoker +
## region, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8922.2 388.9 22.939 < 2e-16 ***
## age 3608.8 167.2 21.587 < 2e-16 ***
## bmi 2068.5 174.4 11.860 < 2e-16 ***
## children 573.2 166.1 3.451 0.000577 ***
## genderfemale 131.3 332.9 0.394 0.693348
## smokeryes 23848.5 413.2 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.1 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
You will run the following commands using the dataset Wine Quality Data Set. The goal is to measure wine’s quality based on its properties. The data comes from [Cortez et al., 2009].
wine.data <- read.csv("winequality-red.csv")
wine.data <- wine.data %>% mutate_if(function(col) is.numeric(col) & !all(col == .$quality), scale)
wine.model <- lm(quality ~ ., wine.data)
summary(wine.model)
##
## Call:
## lm(formula = quality ~ ., data = wine.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68911 -0.36652 -0.04699 0.45202 2.02498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.63602 0.01621 347.788 < 2e-16 ***
## fixed_acidity 0.04351 0.04518 0.963 0.3357
## volatile_acidity -0.19403 0.02168 -8.948 < 2e-16 ***
## citric_acid -0.03556 0.02867 -1.240 0.2150
## residual_sugar 0.02303 0.02115 1.089 0.2765
## chlorides -0.08821 0.01973 -4.470 8.37e-06 ***
## free_sulfur_dioxide 0.04562 0.02271 2.009 0.0447 *
## total_sulfur_dioxide -0.10739 0.02397 -4.480 8.00e-06 ***
## density -0.03375 0.04083 -0.827 0.4086
## pH -0.06386 0.02958 -2.159 0.0310 *
## sulphates 0.15533 0.01938 8.014 2.13e-15 ***
## alcohol 0.29433 0.02822 10.429 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared: 0.3606, Adjusted R-squared: 0.3561
## F-statistic: 81.35 on 11 and 1587 DF, p-value: < 2.2e-16
For any fixed value of X, Y is normally distributed.
We need to check the errors between the estimated and real Ys. To test the normality assumption on the errors, the recommended method is the normal quantile-quantile plot (called the normal Q-Q plot or simply the normal plot) of the residuals. This is a plot of the quantiles of the residuals versus theoretical standard normal distribution quantiles. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.
plot(model, which=2)
We see that the normal plot for `charges`` shows extreme departures in the upper tail. This plot shows the presence of outliers, which are causing non-normality.
hist(data$charges)
Data is highly concentrated on the left side of the plot. The log transformation might help to mitigate the outliers.
hist(log(data$charges))
Now, charges looks more normally distributed. We can make a logarithmic transformation in the model to solve this issue and mitigate the outliers.
log.model <- lm(log(charges) ~ age + bmi + children + gender + smoker + region, data)
plot(log.model, which=2)
We see that the normal plot for charges shows more extreme departures in the upper tail than does the normal plot for log(charges). If these outliers are excluded, then the normal plots will be more linear.
Check the Q-Q plot of the wine.model object. Should we transform the data?
plot(wine.model, which=2)
Print the histogram of the
quality variable
hist(wine.data$quality)
The variance of residual is the same for any value of X.
One assumption of linear regressions is that the variance of the errors is the same across observations, and in particular does not depend on the values of the explanatory variables. Violation of the homoscedasticity assumption is a more serious problem than non-normality and can lead to invalid inferences on the regression coefficients.
Homoscedasticity occurs more often in datasets that have a large range between the largest and smallest observed values. While there are numerous reasons why heteroscedasticity can exist, a common explanation is that the error variance changes proportionally with a factor. In some cases, the variance increases proportionally with this factor but remains constant as a percentage. For instance, a 10% change in a number such as 100 is much smaller than a 10% change in a large number such as 100,000. In this scenario, we will expect to see larger residuals associated with higher values.
To test homoscedasticity, we plot the raw residuals against the fitted values. This plot is called the fitted values plot. If the residuals spread out evenly forming a roughly parallel band around the zero line then it indicates that the error’s standard deviation is constant supporting the homoscedasticity assumption.
plot(model, which=1)
To solve the homoscedasticity problem, we need to transform the dependent variable (Y) to a normal shape. To determine which transformation the dependent variable should take, we use the Box-Cox Transformation. Based on a specific value \(\lambda\), we will determine the required transformation for the dependent variable:
As a result, the dependent variable can require a logarithmic, square root, or inverse transformations. The following table shows the most-common Box-Cox transformations:
R automatically plots the log-Likelihood as a function of possible \(\lambda\) values. The plot indicates both the value that maximizes the log-likelihood, as well as a confidence interval for the lambda value that maximizes the log-likelihood.
boxcox(model, plotit = TRUE)
The value is very close to zero (0). Using the logarithmic transformation is valid and it allows to keep the residuals’ variance constant among the independent values.
If we want to be more specific…
boxcox(model, lambda = seq(0, 0.2, by = 0.05), plotit = TRUE)
According to this transformation, the transformation for charges should be \((y ^{0.14} - 1)/0.14\). We can run this new transformation and observe the results.
boxcox.model = lm((((charges ^ 0.14) - 1) / 0.14) ~ age + gender + bmi + children + smoker + region, data)
summary(boxcox.model)
##
## Call:
## lm(formula = (((charges^0.14) - 1)/0.14) ~ age + gender + bmi +
## children + smoker + region, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4148 -0.7527 -0.2352 0.1724 7.7198
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.59378 0.10043 175.187 < 2e-16 ***
## age 1.64292 0.04317 38.060 < 2e-16 ***
## genderfemale 0.22834 0.08597 2.656 0.008000 **
## bmi 0.33136 0.04503 7.358 3.25e-13 ***
## children 0.39012 0.04289 9.095 < 2e-16 ***
## smokeryes 5.81891 0.10668 54.546 < 2e-16 ***
## regionnorthwest -0.21254 0.12298 -1.728 0.084166 .
## regionsoutheast -0.51838 0.12360 -4.194 2.92e-05 ***
## regionsouthwest -0.43433 0.12341 -3.519 0.000447 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.565 on 1329 degrees of freedom
## Multiple R-squared: 0.7761, Adjusted R-squared: 0.7748
## F-statistic: 575.9 on 8 and 1329 DF, p-value: < 2.2e-16
We can plot the residual errors and the quartiles.
plot(boxcox.model, which=1)
Plot the residuals vs fitted values of the wine.model. Use the function plot with the argument which=1. What can you see?
plot(wine.model, which=1)
Run the Box-Cox regression to check what transformations should be required
boxcox(wine.model, plotit = TRUE)
The value is close to the unit (1), which means that no transformation should be required (\((y^1-1)/1 \sim y\)).
Outliers are observations that deviate significantly from the fitted model, e.g., by more than two or three standard deviations. An outlier has a large residual (the distance between the predicted value and the observed value). Outliers lower the significance of the fit of a statistical model because they do not coincide with the model’s prediction.
Outliers should not be deleted without additional inspection. First, they must be checked for validity and should be deleted only if they are erroneous. If they are valid observations then they may indicate model mis-specification. For example, we may be fitting a straight line to data that actually follow a quadratic or an exponential model. Thus an outlier may be useful for revealing a mispecified model.
Let’s check first the original model’s residuals.
model.res = resid(model)
We can plot the residuals of each observation.
plot(c(1:nrow(data)), model.res, ylab="Residuals", xlab="Observation", main="Residual Plot")
Now, we check the standard residuals of the logarithmic model
log.model.res = resid(log.model)
We plot each observation’s residual.
plot(c(1:nrow(data)), log.model.res, ylab="Residuals", xlab="Observation", main="Residual Plot")
A rule of thumb is considering as outliers observations that have standardized residual larger than 2. We will check which observations have standard residuals bigger than 2.
log.model.res[abs(log.model.res) > 2]
## 220 431 517 1028
## 2.116146 2.120903 2.166365 2.043193
These observations are outliers found in the right side of the charges’ distribution. We can check this with the Q-Q plot too.
plot(log.model, which=2,cex.id = 2)
Check for outliers in the wine.data. Calculate and assign to the variable wine.model.residuals the residuals of the model wine.model using the function resid(). Then, plot the residuals against the wine.data observations.
wine.model.residuals = resid(wine.model)
plot(c(1:nrow(wine.data)), wine.model.residuals, ylab="Residuals", xlab="Observation", main="Residual Plot")
Check for potential outliers using the rule of thumb (std. residuals bigger than 2). How many potential outliers do you see?
wine.model.residuals[abs(wine.model.residuals) > 2]
## 46 391 460 653 833 900 1277 1479
## -2.000036 2.024975 -2.061787 -2.474653 -2.689107 -2.137879 -2.327958 -2.132611
## 1506
## -2.223015
The idea of fitting a model is to capture the overall pattern of variation in the response variable as a function of the predictor variables. So the fit of the model should be determined by the majority of the data and not by a few so-called influential observations (also called high leverage observations).
An influential observation is any observation that has a large effect on the slope of a regression line fitting the data. They are generally extreme values. The process to identify an influential observation begins by removing the suspected influential point from the data set. If this removal significantly changes the slope of the regression line, then the point is considered an influential observation.
A common measure of influence is Cook’s Distance. This is a function of the standardized residuals. Cook’s distance for the ith observation measures the effect of deleting that observation on the fitted values of all observations
plot(log.model, which=4)
A rule of thumb is to delete those observations with a Cook distance higher than 4/(number of observations).
log.model.cook.distances <- cooks.distance(log.model)
influential_obs <- log.model.cook.distances[(log.model.cook.distances) > 4/nrow(data)]
influential_obs <- as.numeric(names(influential_obs))
We create the model without the influential observations
data.without.influential <- data[-influential_obs,]
log.model.no.outliers <- lm(log(charges) ~ age + bmi + children + gender + smoker + region, data.without.influential)
We plot the distribution of these values without the influential observations.
hist(log(data.without.influential$charges))
We print the model’s results
summary(log.model.no.outliers)
##
## Call:
## lm(formula = log(charges) ~ age + bmi + children + gender + smoker +
## region, data = data.without.influential)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.84090 -0.10713 0.00638 0.07927 1.15633
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.734615 0.018169 480.741 < 2e-16 ***
## age 0.571294 0.007954 71.824 < 2e-16 ***
## bmi 0.057434 0.008209 6.997 4.30e-12 ***
## children 0.135800 0.007738 17.550 < 2e-16 ***
## genderfemale 0.091397 0.015467 5.909 4.45e-09 ***
## smokeryes 1.600646 0.019773 80.951 < 2e-16 ***
## regionnorthwest -0.077126 0.022220 -3.471 0.000536 ***
## regionsoutheast -0.141325 0.022318 -6.332 3.38e-10 ***
## regionsouthwest -0.116561 0.022200 -5.250 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2708 on 1227 degrees of freedom
## Multiple R-squared: 0.9096, Adjusted R-squared: 0.909
## F-statistic: 1543 on 8 and 1227 DF, p-value: < 2.2e-16
We check once again the Q-Q plot
plot(log.model.no.outliers, which=2)
In summary, the model improved (R^2=0.90) and the coefficients are all significant.
Identify any influential observations in the wine.data dataframe by calculating the Cook’s Distances. Start first by plotting the graph
plot(wine.model, which=4)
Calculate Cook’s distance values of each observation of the model wine.model using the function cooks.distance(). Then, use the rule of thumb to identify the influential observations.
wine.model.cook.distances <- cooks.distance(wine.model)
influential_obs <- wine.model.cook.distances[(wine.model.cook.distances) > 4/nrow(wine.data)]
influential_obs <- as.numeric(names(influential_obs))
Then, remove the influential observations of the wine.data and create a new model
wine.data.without.influential <- wine.data[-influential_obs,]
wine.model.no.outliers <- lm(quality ~ ., wine.data.without.influential)
summary(wine.model.no.outliers)
##
## Call:
## lm(formula = quality ~ ., data = wine.data.without.influential)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.67161 -0.36353 -0.05916 0.39741 1.85537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.65496 0.01439 393.023 < 2e-16 ***
## fixed_acidity 0.10537 0.04270 2.468 0.0137 *
## volatile_acidity -0.17616 0.02063 -8.539 < 2e-16 ***
## citric_acid -0.06540 0.02640 -2.477 0.0134 *
## residual_sugar 0.05231 0.02150 2.433 0.0151 *
## chlorides -0.08379 0.01966 -4.262 2.16e-05 ***
## free_sulfur_dioxide 0.02744 0.02089 1.314 0.1891
## total_sulfur_dioxide -0.10657 0.02257 -4.722 2.55e-06 ***
## density -0.06528 0.03807 -1.715 0.0866 .
## pH -0.02644 0.02733 -0.967 0.3335
## sulphates 0.18965 0.01955 9.701 < 2e-16 ***
## alcohol 0.27940 0.02615 10.684 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5553 on 1491 degrees of freedom
## Multiple R-squared: 0.4272, Adjusted R-squared: 0.423
## F-statistic: 101.1 on 11 and 1491 DF, p-value: < 2.2e-16
We check once again the Q-Q plot
plot(wine.model.no.outliers, which=2)
Much better!
The most common type of misspecification is non-linearity. We can check whether the model is correctly specified by plotting the dependent variable and each independent variable. Since the model was built using different independent variables, we need to control the presence of other independent variables while we change variable of the observed independent variable. To do this, we plot the added variable plots. These plots display the relationship between the dependent variable and one independent variable in the regression model, while holding the other independent variables constant. These plots are also called partial regression plots.
To create added variable plots in R, we can use the avPlots() function and use it for each numeric variable:
avPlot(log.model.no.outliers,"age")
avPlot(log.model.no.outliers,"bmi")
avPlot(log.model.no.outliers,"children")
The x-axis displays a single predictor variable and the y-axis displays the response variable. The blue line shows the association between the predictor variable and the response variable, while holding the value of all other predictor variables constant. The points that are labelled in each plot represent the two observations with the largest residuals and the two observations with the largest partial leverage.
After checking the plots for the logarithmic model, we can see clearly that there is a linear relationship between the numeric variables and the dependent variable charges.
Multicollinearity occurs when independent variables in a regression model are correlated. In other words, one independent variable can be used to predict another independent variable. This correlation is a problem because independent variables should be independent, and it creates redundant information skewing the results in a regression model. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.
The stronger the correlation, the more difficult it is to change one variable without changing another. It becomes difficult for the model to estimate the relationship between each independent variable and the dependent variable independently because the independent variables tend to change in unison.
Some examples of correlated independent variables are: a person’s height and weight, age and sales price of a car, or years of education and annual income. Multicollinearity causes the following two basic types of problems:
The variance inflation factor (VIF) test identifies correlation between independent variables and the strength of that correlation. A common rule of thumb is to declare multicollinearity if most of the VIF are larger than 10.
car::vif(log.model.no.outliers)
## GVIF Df GVIF^(1/(2*Df))
## age 1.030850 1 1.015308
## bmi 1.123649 1 1.060023
## children 1.005972 1 1.002982
## gender 1.007666 1 1.003826
## smoker 1.014312 1 1.007130
## region 1.101291 3 1.016211
In this example, the VIF scores are lower than 2.0. Therefore, there are no signs of multicollinearity with this model.
Check whether the wine.model.no.outliers has multicollinearity.
car::vif(wine.model.no.outliers)
## fixed_acidity volatile_acidity citric_acid
## 8.062905 1.891349 3.261109
## residual_sugar chlorides free_sulfur_dioxide
## 1.681510 1.432694 2.053849
## total_sulfur_dioxide density pH
## 2.301593 6.464433 3.343257
## sulphates alcohol
## 1.435240 3.145932